A string pattern regression algorithm and its application to pattern discovery in long introns.

نویسندگان

  • Hideo Bannai
  • Shunsuke Inenaga
  • Ayumi Shinohara
  • Masayuki Takeda
  • Satoru Miyano
چکیده

We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Practical Algorithms for Pattern Based Linear Regression

We consider the problem of discovering the optimal pattern from a set of strings and associated numeric attribute values. The goodness of a pattern is measured by the correlation between the number of occurrences of the pattern in each string, and the numeric attribute value assigned to the string. We present two algorithms based on suffix trees, that can find the optimal substring pattern in O...

متن کامل

Crochemore's String Matching Algorithm: Simplification, Extensions, Applications

We address the problem of string matching in the special case where the pattern is very long. First, constant extra space algorithms are desirable with long patterns, and we describe a simplified version of Crochemore’s algorithm retaining its linear time complexity and constant extra space usage. Second, long patterns are unlikely to occur in the text at all. Thus we define a generalization of...

متن کامل

Application of Pattern Recognition Algorithms for Clustering Power System to Voltage Control Areas and Comparison of Their Results

Finding the collapse susceptible portion of a power system is one of the purposes of voltage stability analysis. This part which is a voltage control area is called the voltage weak area. Determining the weak area and adjecent voltage control areas has special importance in the improvement of voltage stability. Designing an on-line corrective control requires the voltage weak area to be determi...

متن کامل

Application of Pattern Recognition Algorithms for Clustering Power System to Voltage Control Areas and Comparison of Their Results

Finding the collapse susceptible portion of a power system is one of the purposes of voltage stability analysis. This part which is a voltage control area is called the voltage weak area. Determining the weak area and adjecent voltage control areas has special importance in the improvement of voltage stability. Designing an on-line corrective control requires the voltage weak area to be determi...

متن کامل

Large-Scale Regression-Based Pattern Discovery in International Adverse Drug Reaction Surveillance

This paper demonstrates the first use of shrinkage logistic regression as a pattern discovery method in international adverse drug reaction surveillance. This novel method is compared to bivariate pattern discovery, the standard approach in the application domain. Our results show that regression can eliminate false positives and false negatives due to the impact of other covariates, and that i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome informatics. International Conference on Genome Informatics

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2002